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The decision making behaviors of humans and animals adapt and then satisfy an "operant 
matching law" in certain type of tasks. This was first pointed out by Herrnstein in his 
foraging experiments on pigeons. The matching law has been one landmark for elucidating 
the underlying processes of decision making and its learning in the brain. An interesting 
question is whether decisions are made deterministically or probabilistically. Conventional 
learning models of the matching law are based on the latter idea; they assume that 
subjects learn choice probabilities of respective alternatives and decide stochastically with 
the probabilities. However, it is unknown whether the matching law can be accounted 
for by a deterministic strategy or not. To answer this question, we propose several 
deterministic Bayesian decision making models that have certain incorrect beliefs about an 
environment. We claim that a simple model produces behavior satisfying the matching law 
in static settings of a foraging task but not in dynamic settings. We found that the model 
that has a belief that the environment is volatile works well in the dynamic foraging task 
and exhibits undermatching, which is a slight deviation from the matching law observed in 
many experiments. This model also demonstrates the double-exponential reward history 
dependency of a choice and a heavier-tailed run-length distribution, as has recently been 
reported in experiments on monkeys. 

Keywords: decision making, operant matching law, Bayesian inference, dynamic foraging task, heavy-tailed 
reward history dependency 



1. INTRODUCTION 

Does the brain play dice? This is a controversial question about 
the underlying processes of the brain in making a choice from 
several alternatives: Does the brain decide deterministically with 
some internal decision variables? Or does it calculate the proba- 
bility of choosing individual alternatives and cast a "biased die" 
(Sugrue et al, 2005)? The former strategy is suggested accord- 
ing to our everyday experience. However, it is possible to think 
that choices emerge probabilistically by observing a sequence 
of decisions in a repetitive task. Herrnstein conducted a forag- 
ing experiment where a pigeon was placed into a box that was 
equipped with two keys and when a key was pressed it was 
rewarded with concurrent variable-interval schedules. He found a 
relationship between rewards and choices known as the "operant 
matching law" (Herrnstein, 1961). The law states that the frac- 
tion of the number of times one alternative is chosen against the 
total number of choices matches the fraction of the cumulative 
reward obtained from the alternative against the total reward. 
Behaviors satisfying the law have been observed in a variety of 
task paradigms and across species (de Villiers and Herrnstein, 
1976; Gallistel, 1994; Anderson et al, 2002). Several learning 
models have been proposed to account for matching behavior 



(Corrado et al., 2005; Lau and Glimcher, 2005; Loewenstein and 
Seung, 2006; Soltani and Wang, 2006; Sakai and Fukai, 2008a; 
Simen and Cohen, 2009). These models have a commonality in 
that a model learns the probabilities of choosing each alternative 
directly, and then a choice is made stochastically. However, it is 
yet unknown whether matching behaviors can be accounted for 
by a deterministic model. 

Here, we propose deterministic Bayesian decision making 
models for a two-alternative choice task. Our models stand on the 
incorrect but conceivable postulate that animals have a belief that 
the choice made in one trial does not affect a reward in subsequent 
trials. The models estimate the unknown reward probabilities 
for each alternative and deterministically choose the alternative 
that has the highest reward probability according to the winner- 
take-all principle. We first study a model with belief that the 
environment does not change. Note that this is an extension of 
the fixed belief model (FBM) (Yu and Cohen, 2009) for the two- 
alternative choice task. We demonstrate that this model satisfies 
the matching law in a steady state in static foraging tasks, in 
which reward baiting probabilities are fixed, but not in dynamic 
foraging tasks, in which the reward baiting probabilities change 
abruptly. Then, we devise two models that forget past experience 
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and exhibit matching behaviors even in dynamic tasks. Moreover, 
these models can explain undermatching, which is a phenomenon 
observed across different species (Baum, 1974; de Villiers and 
Herrnstein, 1976; Baum, 1979; Gallistel, 1994; Anderson et al., 
2002; Sugrue et al., 2004; Lau and Glimcher, 2005). We test 
these models by comparing their predicted reward history depen- 
dencies and run-length distributions to those seen in a monkey 
experiment. 

2. RESULTS 

We studied deterministic Bayesian decision making models that 
demonstrated matching behaviors in a foraging task. The for- 
aging task is a decision making task that simulates a foraging 
environment where an animal chooses one out of several forag- 
ing alternatives. There are two alternatives in this study although 
our results do not depend on this. We employed discrete trial-to- 
trial tasks that have often been used in recent experiments (Sugrue 
et al., 2004; Corrado et al, 2005; Lau and Glimcher, 2005). Each 
alternative has binary baiting stated (;' € {1,2} is the index of an 
alternative), where/, = 1 if a reward is baited and / = 0 other- 
wise. If/, = 0, a reward is baited (f; = 1) at the beginning of each 
trial by baiting probability X', where t represents the number of 
the trial. If the baiting probabilities are fixed across trials, the task 
is called a static foraging task, otherwise it is called a dynamic for- 
aging task (Sugrue et al., 2004). Suppose that r f indicates whether 
a subject receives a reward (r l = 1) or not (r' = 0), and c { indi- 
cates whether the subject chooses alternative i (c- = 1) or not 
(c- = 0) in trial t. When the subject chooses a baited alternative, 
i.e., /■ = 1 and c- = 1, the baited reward is consumed (/■ <— 0). 
This reward schedule is known as a "concurrent variable-interval 
schedule" (Baum and Rachlin, 1969). 

Whichever alternative the subject chooses in the foraging task, 
the choice can affect the reward probabilities of alternatives in 
the future. Therefore, the optimal strategy is not to exclusively 
choose the foraging alternative that has the highest baiting prob- 
ability. A behavioral strategy obeying the matching law is known 
to be nearly optimal for this task (Baum, 1981). Formally, the law 
states that 

— ^ = — ^, (1) 

where R' and C f correspond to the total reward obtained from 
alternative i and the number of choices of alternative i until 
trial t. It is known that human and animal behaviors in these 
kinds of tasks are well described by the generalized matching law 

(Baum, 1974) 

logCR^) = sl°g(C^) + logfc, (2) 

where 5 is sensitivity and k is bias. Equation (2) is equivalent to 
(1) if both s and k are unities. 

2.1. SIMPLE BERNOULLI ESTIMATORS 

First, we studied a simple normative Bayesian decision making 
model to clarify the underlying feasible computation for match- 
ing behaviors. Suppose that a subject makes a decision simply 



depending on its estimates of the reward probabilities for the 
alternatives. The estimate can be formally described as 

P\ +1 =p{r\ +l = 1^,0, (3) 

where R' is a list of reward vectors r' = {r\ , r' 2 ) from trials 1 to t 
and C' is a list of choice vectors c ( = (c[ , c' 2 ) from trials 1 to t. The 
model employs a winner-take-all (WTA) strategy, i.e., it chooses 
the alternative that has the highest P'. The model requires an 
assumption about a reward assignment mechanism to estimate 
P*. + . One simple and conceivable assumption is that a choice is 
rewarded according to hidden reward probability (x^ that is irrel- 
evant to the past reward and choice history, i.e., p(r f = 1) = [i f . 
This assumption is incorrect for our tasks but we have assumed 
that the model employs it and predicts (X- by Bayesian inference. 
Hence, P[ + is given by the predictive distribution over u, ■: 

P\ +l = f 1 d [ i [ ip( [ i t + 1 = [ i\R t ,C t ). (4) 
Jo 

Note that p([i t+1 = (i|# f , C r ) can include a model's belief about 
the change of u,- in between trials. Our first model assumes 
that |x' is time invariant, i.e., p(u,- +1 = \l\R* , C f ) =p(u, f = 
\i\R* , C'). The posterior distribution for an alternative is not 
updated if the alternative is not chosen. If it is chosen, the 
posterior distribution is updated 

p( [ L\ = l L\R t ,C t )cxp(r\\ [ L t l = l L)p( [ L]- l = l L\R t - 1 ,C t - 1 ) 

= m/ki - iii'^wr 1 = ^ t_1 > c^ 1 )- 

(5) 

We employ the Beta prior, p(\h® = |x) = Beta((x|a, b), which is 
a conjugate for the likelihood. Note that we set the hyper- 
parameters, a = b = 1, to make the prior non-informative in all 
simulations. Therefore, the posterior becomes a Beta distribution: 

p(|x, = [i\R], Cf) = Beta(u,|^ + a, C\ - R\ + b). (6) 

From Equations (4) and (6), we obtain 



This model is a natural extension of FBM (Yu and Cohen, 2009) 
to the two -alternative choice task (for this reason, we will refer to 
our model as FBM). An alternative is repeatedly chosen while its 
predictive distribution is higher than those of the other due to the 
WTA strategy. Because the empirical probability of reward for an 
alternative converges to its baiting probability in repeated choices, 
P' gradually approaches to X, and the variance of P' decreases. As 
a result, FBM tends to choose exclusively the high payoff alter- 
native after a large number of observations. Hence, the matching 
law [Equation (1)] is satisfied in t — >■ oo because such a exclusive 
choice unboundedly increases both R' and C' of the high payoff 
alternative. 
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We simulated FBM in static and dynamic foraging tasks. The 
time course for the predictive distributions is shown in Figure 1A. 
As can be expected, both predictive distributions approach the 
respective baiting probabilities and FBM behavior converges to 
exclusive choice of the high payoff alternative in static foraging 
tasks. However, the steady-state choice behavior of animals in 
static concurrent VI schedules has not been thought to be exclu- 
sive (Baum, 1982; Davison and McCarthy, 1988; Baum et al, 
1999). It might be that there are not enough trials for choice 
behavior to actually reach a steady state. Figures 1B,C plot the 
log ratios of rewards and choices in both tasks. The marginal 
histograms indicate the FBM's strong preference for the alterna- 
tive that has the highest baiting probability, because most pairs of 
log ratios lie near the endpoints of the matching line. We found 
that bias is nearly zero and sensitivity is nearly one in the static 
foraging tasks (Figure IB) by least-square fitting the generalized 
matching law [Equation (2)] to the data. Therefore, the model 
exhibits matching behavior in the static foraging tasks. However, 
the model no longer exhibits matching behavior in dynamic for- 
aging tasks, a result that is inconsistent with the behavior of 
monkeys (Corrado et al., 2005) (Figure 1C). This can be because 
the model adheres to past experience and cannot adapt rapidly to 
changes in the environment. 

2.2. EXTENDED BERNOULLI ESTIMATORS 

One possible way of improving the model to enable it to rapidly 
adapt to changes in the environment is to introduce a forgetting 
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FIGURE 1 | Simulation results for FBM. (A) Time course of predictive 
distributions for alternatives #1 (red solid line) and #2 (blue solid line) in 
static foraging task. Dashed lines indicate baiting probabilities of 
alternatives #1 (dark red) and #2 (dark blue). Upper and lower dots 
respectively represent choices for alternatives #1 and #2 in that trial and 
colored dots (red or blue) represent that the model received a reward at 
that trials. (B) Reward log ratios as a function of count log ratios in static 
and (C) dynamic foraging tasks. Blue symbols represent pairs of log ratios 
calculated in block where baiting probabilities are fixed and distributions of 
dots are represented by marginal histograms. Red line indicates best-fitted 
line to points and inner text shows its slope, i.e., sensitivity parameter of 
generalized matching law. Dashed line is identity line. 



mechanism for past rewards and choice history. We therefore 
assume a simple extended model, which utilizes only the L most 
recent rewards and choices for the estimates. Hence, the predictive 
distribution becomes 



?t+ l 



(E 



;=o 



') + a 



(El: 



(8) 



We refer to this model as windowed FBM (WFBM). 

Another possibility may be derived from the idea that humans 
and animals may innately believe their environment is volatile. 
Here, we propose a model that estimates time-varying reward 
probabilities. Although there are several ways to model a belief 
of a volatile environment, we assume our model believes that )i' 
remains unchanged with probability a, or else (with probability 
1 — a) changes completely. This idea is derived from the dynamic 
belief model (DBM), proposed by Yu and Cohen as a model of 
sequential effect (Yu and Cohen, 2009). Our model is a natu- 
ral extension of DBM to a two-alternative choice task. Thus, we 
refer to our model as DBM. The transition of |x f is modeled as a 
mixture of the posterior and prior distributions 



f+i 



= i i\R t ,C t ) = ap(\i t i = p,LR f ,C ( ) 
+ (1 - a)Beta(|x|a, b), 



(9) 



where 0 < a < 1 represents the model's expectations of the sta- 
bility of the environment. However, the posterior distribution is 
no longer a Beta distribution: 

p(n\ = C) 

= p^\ = v\r\, 4 = 1, R<-\ C-VpGil = ixli?^ 1 , C- 

p(r\ = l|nj = li) y'/ p(r?=0|^ = u,) 1 ~ 
p(r\ = l|J?t- !, ) \p(r\ = 0LR*- 1 , C f - [ ) 



p(uj = l i|£'-\C f - 1 ) 

JX \ ' / 1 - LI 



I -Pi 



(10) 



where we use Equation (3). Then, predictive distribution P' is cal- 
culated with Equations (4), (10), and (11). Note that these models 
are equivalent to FBM when L —> oo and a = 1 . 

Figure 2 has the time courses for the predictive distributions of 
WFBM and DBM, and the posterior distributions of DBM in the 
dynamic foraging task. Neither model is stuck on one alternative 
and can follow the changes in schedules as expected. There is a 
clear difference in the predictive distribution trajectories. Because 
WFBM exploits recent samples, its predictive distribution for 
the unchosen alternative can approach the true baiting probabil- 
ity. DBM's predictive distribution for the unchosen alternative, 
on the other hand, is only retracted to the mean of the prior, 
i.e., 0.5. Both models demonstrate matching behaviors even in 
the dynamic foraging task (Figure 3). More precisely, the behav- 
iors slightly deviate from the matching law toward an unbiased 
choice. This phenomenon is known as undermatching (Baum, 
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FIGURE 2 | Simulation results for WFBM and DBM in dynamic foraging 
task. Simulation parameters were set to L = 60 and a = 0.99. (A) Time 
course of predictive distributions of WFBM and (B) DBM. Details in figure 
are described in caption of Figure 1A. (C) Time course of posterior 
distributions of DBM for reward probabilities of alternative #1 (top) and #2 
(bottom). Brown dashed lines are baiting probabilities for respective 
alternatives. 



1979). Because the models' parameters I and a control the effect 
of past experience, the degree of undermatching is controlled by 
the parameters. The sensitivities that were fitted in the experi- 
ments were in a range of about 0.44 to 0.91 (Hinson and Staddon, 
1983; Corrado et al., 2005; Lau and Glimcher, 2005). Hence, we 
basically focused on parameter regions 10 < L and 0.9 < a. 

The dependence of choices on reward history has been studied 
in several monkey experiments. An exponential shaped depen- 
dency was first reported (Sugrue et al., 2004) and then heavier- 
tailed dependencies were reported (Corrado et al., 2005; Lau and 
Glimcher, 2005). We tested our models by calculating the depen- 
dence of choices on reward history (Figure 4A). Suppose that 
dependency is expressed with a linear filter kernel k (i) as in previ- 
ous studies. The kernel is calculated by minimizing the following 
Wiener-Hopf equation, 



(4 - 4> - 5> (i)(r r 



(id 



Then, we fit the exponential filter and double-exponential filter 
that were introduced by Corrado et al. (2005) to the normalized 
kernel: 
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FIGURE 3 | Analytical results for matching behavior of WFBM and 
DBM in dynamic foraging tasks. (A,B) Reward log ratios as a function of 
count log ratios. Details in figure are described in caption of Figures 1B,C. 
Simulation parameters were set to L = 40 and a = 0.99. (CD)Sensitivity as 
a function of parameters of WFBM and DBM. 



«i(0 



62 (0 



exp(-iyto) 
ELi exp(-fc/t 0 )' 
exp(-i/ti) 



EL 



! exp(-fc/T!) 



+ (l-p) 



exp( - i/x 2 ) 



EL 



! exp(-k/x 2 ) 



(12) 



where To and x\ < x 2 are time constants and 0 < p < 1 is the 
combining rate. Note that e 2 is identical to €1 when x\ = x 2 . 
The double-exponential filter is rather more well-fitted than 
the single one for WFBM and DBM (likelihood ratio test, 
p <3C 0.001; adjusted r 2 for double and single exponential fil- 
ters are 0.99 and 0.98 for WFBM, and 0.94 and 0.85 for 
DBM). The kernel for WFBM has a negative value around I 
but it disappears if L is much longer than K. The kernel for 
DBM drops sharply and decays slowly. The sharp drop proba- 
bly arose from the exponential decay of reward history, which 
is embedded in the posterior distributions [Equation (10)]. 
Because a decision is made due to the difference in two pre- 
dictive distributions and both distributions decay at the same 
rate, the effect of one predictive distribution would have per- 
sisted slightly longer and hence the kernel included a longer 
exponential component. This characteristic is qualitatively con- 
sistent with the experimental results Corrado et al. (2005). 
The fitting parameters for the two monkeys in Corrado et al. 
(2005) were p = 0.4, ti = 2.2, and x 2 = 17.0 (monkey F), 
and p = 0.25, ti = 0.9, and x 2 = 12.6 (monkey G). Although 
there were no suitable WFBM and DBM parameters that 
exactly matched their fitting parameters to those of the mon- 
keys, similar values were obtained for smaller L and larger a 
(Figure 4B). 

It is known that the probability of switching alternatives is 
nearly constant against the number of consecutive choices for one 
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FIGURE 4 | Results of Wiener-Hopf analysis for WFBM and DBM in 
dynamic foraging task. (A) Symbols represent normalized Wiener-Hopf 
kernel and red line represents best fitted double-exponential filter. 
Double-exponential filters are better fitted to data than single-exponential 
filter (likelihood ratio test, p 0.001). Insets show time constants for each 
exponential component and their combining rate. Simulation parameters 
were set to L = 60 and a = 0.99. (B) Fitted parameters of 
double-exponential filter p, xi , and t2 to simulation data of WFBM (left 
column) and DBM (right column). Abscissas represent parameters of 
WFBM or DBM. 



alternative (run length) in the concurrent VI schedule (Heyman 
and Luce, 1979). Hence, run lengths are distributed exponentially 
but, in a dynamic foraging task, the distribution seems to be a 
mixture of exponentials (Corrado et al., 2005). The distribution 
of WFBM does not monotonically decrease and there is a peak 
where the run length is nearly equal to L. Therefore, the distri- 
bution is neither an exponential nor a mixture of exponentials. 
This nature is consistent on different values of L. However, DBM 
demonstrates an exponential like distribution. We fitted single 
and double exponential functions, 

<$>i(l) = v 0 exp(-v 0 (/- 1)), 
4»2 (0 = yvi exp(-vi(7 - 1)) 



+ (1 - Y)v 2 exp(-v 2 (/- 1)), 



(13) 



to the distribution, where I > 1 is the run length, vq, and vi < v 2 
are the rate parameters and y is the combining rate. The distribu- 
tion is well-fitted by the double exponential function (Figure 5B; 
likelihood ratio test, p 0.001; r 2 for the double and single 
exponential functions are 0.99 for the former and 0.96 for the 
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FIGURE 5 | Run-length distributions of windowed FBM and DBM in 
dynamic foraging task. Simulation parameters were set to L = 60 and 

a = 0.99. (A,B) Bars represent densities of run length for alternative #1. 
Single (green line) and double-exponential (red line) functions fitted to 
run-length distributions of DBM. Double-exponential function is fitted better 
than single one (likelihood ratio test, p 0.001). (C) Log probability density 
of run-length distribution of DBM (black line) and linear-nonlinear Poisson 
models (red and green lines) that are fitted to monkeys' experimental data 
in Corrado et al. (2005). (D) Fitted parameters of double-exponential 
function with different values of a. Left ordinate indicates value of rate 
parameters v-\ (green line) and V2 (blue line), and right indicates value of 
combining rate y (red line). 



latter). The run-length distribution in monkey experiments has 
few frequencies of a very short run length; however our models 
have the largest frequency at the run length of 1 (Figures 5A,B)- 
This difference can be due to the absence of change-over-delay 
(COD) in our schedule. If our model had and exploited prior 
knowledge about COD as well as the proposed model for the 
previous experiment (Corrado et al., 2005), the frequency at a 
run length of 1 could disappear. We simulated linear-nonlinear- 
Poisson (LNP) models that were fitted to the monkeys' experi- 
mental data in Corrado et al. (2005) and compared run-length 
distributions (Figure 5C). Note that COD was not considered for 
the LNP models that was different from Corrado et al.'s approach 
Corrado et al. (2005). Because the absence of COD could affect 
the occurrence of short run lengths, log probability densities 
were compared to count differences at long run lengths. The cal- 
culated mean squared differences of DBM against LNP models 
for two monkeys corresponded to ~0.67 and 0.16. The double- 
exponential function is better than the single one in different a 
and the fitted parameters are slightly affected by a (Figure 5D). 

2.2. 1. Harvesting performance 

Figure 6A compares the harvesting performance of the models, 
which is normalized by the performance of a near-optimal prob- 
abilistic decision making model. The near-optimal model knows 
the details of the schedules, i.e., both the baiting probabilities 
and the change points. It distributes its choices according to the 
choice probabilities that on average maximize the total reward 
(Sakai and Fukai, 2008a). Due to such given knowledge, none of 
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the other models can exceed the performance of the near-optimal 
model. We carried out paired f-tests between the models, in which 
the means of total reward for an identical schedule were paired. 
The FBM and WFBM (L = 60) are more inferior than the ran- 
dom choice model that chooses by tossing an unbiased coin. The 
DBM (a = 0.99) outperforms FBM, WFBM, and LNP models 
(p <?C 0.001) but the differences from the LNP models are very 
small. Harvesting performance is less when a model memorizes a 
more distant past (Figure 6B). 

3. DISCUSSION 

We demonstrated that deterministic Bayesian decision making 
models can account for the matching law. We confirmed that a 
simple Bernoulli estimator with a deterministic decision policy 
demonstrated matching behavior in a static foraging task. We also 
studied an extended model that includes a belief about a chang- 
ing environment. The belief effectively works to wipe out the past 
experience of the model and hence the model can capture three 
characteristics of behaviors observed in the experiments. First, 
our model accounts for undermatching, which is a well-known 
phenomenon in which choices deviate slightly from the matching 
law (Baum, 1974, 1979; Sugrue et al, 2004). Several studies have 
addressed possible causes of undermatching, i.e., limitations in 
the learning rule (Soltani and Wang, 2006), mistuning of param- 
eters (Loewenstein, 2008), and diffusion of synaptic weights 
(Katahira et al., 2012). This study suggested the cause from a com- 
putational perspective, i.e., undermatching was the consequence 
of a belief in environmental volatility. Second, our model exhibits 



double-exponential shaped reward history dependency. This is 
consistent with recent monkey experiments (Corrado et al., 2005; 
Lau and Glimcher, 2005). Third, the run-length distribution of 
our model is better fitted by a double-exponential function than 
a single exponential function. This is also consistent with the 
previous study (Corrado et al., 2005) although our task did not 
include changeover delay, which can strongly affect the frequency 
of shorter run lengths. Quantitatively validating our model such 
as checking its goodness of fit to raw experimental data would be 
worthwhile. 

The previous models implicitly or explicitly use the strategy of 
probabilistic choice selection and they learn the choice probability 
of respective alternatives that satisfy the matching law (Corrado 
et al., 2005; Lau and Glimcher, 2005; Loewenstein and Seung, 
2006; Soltani and Wang, 2006; Sakai and Fukai, 2008a; Simen and 
Cohen, 2009). Such probabilistic models use a scaling parameter 
that maps internal decision variables to appropriate choice prob- 
abilities and the parameter generally requires fine-tuning (Soltani 
and Wang, 2006; Fusi et al., 2007). In contrast, as our models 
act deterministically according to decision variables, no tuning is 
required for a parameter at the decision stage. 

We argued that matching behavior can be explained by a deter- 
ministic choice strategy at the computational level. Loewenstein 
and Seung (2006) proposed biologically inspired synaptic learn- 
ing rules for neural networks at the neural implementation level. 
They proved that neural networks developed by covariance-based 
learning with the assumption of a low learning rate demonstrated 
matching behaviors. However, this assumption causes the choice 
to be affected by relatively distant past rewards and the kernel 
for reward history dependency consequently flattens. A more 
microscopic spiking neural network model, in which double- 
exponential dependency in foraging tasks is demonstrated, has 
been proposed (Soltani and Wang, 2006). However, there is a 
huge gap between the computational principles of our determin- 
istic macroscopic models and their stochastic microscopic model. 
This gap can be filled by using a method of reducing spiking neu- 
ron models to the diffusion equation (Roxin and Ledberg, 2008). 
There have been some other neural network models that can show 
heavy-tailed dependency of choices on past experience. A reser- 
voir network (Jaeger et al., 2007), which can reproduce neural 
activity in the monkey prefrontal cortex, preserves the memory 
trace of a reward with one or two time constants (Bernacchia 
et al., 2011). The composite learning system of faster and slower 
components is flexible to abrupt changes in the environment 
(Fusi et al., 2007). These models could be a possible neural 
implementation for our model. Furthermore, our models are 
an extension of that by Yu & Cohen who argued that decision 
variables of their model can be approximated by a linear expo- 
nential filter, and that there are neural implementations for that 
operation (Yu and Cohen, 2009). 

Because matching behavior often deviates from optimal 
behavior in the sense of total reward maximization (Vaughan, 
1981), it is not likely to be a consequence of optimization. 
However, our model acts optimally in terms of Bayesian decision 
making with an incorrect assumption about the environment, 
indicating that matching behavior is a bounded optimal behavior. 
This idea is consistent with the theory of Sakai and Fukai (2008b) 
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FIGURE 6 | Normalized harvesting performance of each model. (A) 

Average normalized total rewards earned by each model divided by average 
total rewards of near-optimal model. Near-optimal model uses strategy that 
maximizes average total rewards proposed by Sakai and Fukai (2008a) with 
previous knowledge on details of schedule. Error bars indicate standard 
deviations around mean. Simulation parameters were set to L = 60 and 
a = 0.99. (B) Harvesting performance of WFBM and DBM as a function of 
their parameters. ***p <S 0.001 . 
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who found any learning method neglecting the effect of a choice 
on future rewards displays matching behavior if choice proba- 
bilities are differentiable with respect to parameters (Sakai and 
Fukai, 2008b). Note that the choice probabilities of our model are 
not differentiable. Hence, we confirmed that their theory could be 
correct in such extreme cases. 

4. MATERIALS AND METHODS 
4.1. DETAILS OF SIMULATION 

The reward schedule is analogous to the experiment by Corrado 
et al. (2005). We randomly set the baiting probabilities that sat- 
isfied Xi + X2 = 0.3 and their ratios were 1:8, 1:6, 1:3, 1:2, 1:1, 
2:1, 3:1, 6:1, and 8:1 in a static setting. There were 10, 000 trials 
in the simulations. The baiting schedule in the dynamic setting 
was divided into blocks, in which the baiting probabilities were 
fixed, and their sum and ratios were the same as those in the static 
setting. The block length was uniformly sampled from [50, 300] 
and there were 300 blocks in the simulations. We did not include 
change-over-delay (COD), i.e., the cost to switch from one alter- 
native to another, which was different from Corrado et al. (2005). 
The hyper-parameters were set to a = 1 and fa = 1 in all the 
simulations. 
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